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A fundamental problem in the analysis of network data is the detection of network communities, 
groups of densely interconnected nodes, which may be overlapping or disjoint. Here we describe 
a method for finding overlapping communities based on a principled statistical approach using 
generative network models. We show how the method can be implemented using a fast, closed- 
form expectation-maximization algorithm that allows us to analyze networks of millions of nodes in 
reasonable running times. We test the method both on real-world networks and on synthetic bench- 
marks and find that it gives results competitive with previous methods. We also show that the same 
approach can be used to extract nonoverlapping community divisions via a relaxation method, and 
demonstrate that the algorithm is competitively fast and accurate for the nonoverlapping problem. 
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I. INTRODUCTION 

Many networked systems, including biological and so- 
cial networks, are found to divide naturally into modules 
or communities, groups of vertices with relatively dense 
connections within groups but sparser connections be- 
tween them [H, i]. Depending on context, the groups 
may be disjoint or overlapping. A fundamental problem 
in the theory of networks, and one that has attracted 
substantial interest among researchers in the last decade, 
is how to detect such communities in empirical network 
data [3, Q . There are a number of desirable properties 
that a good community detection scheme should have. 
First, it should be effective, meaning it should be able to 
accurately detect community structure when it is present. 
There are, for instance, many examples of networks, both 
naturally occurring and synthetic, for which the commu- 
nity structure is widely agreed upon, and a successful 
detection method should be able to find the accepted 
structure in such cases. Second, methods based on sound 
theoretical principles are preferable over those that are 
not. A method based on a mere hunch that something 
might work is inherently less trustworthy than one based 
on a provable result or fundamental mathematical in- 
sight. Third, when implemented as a computer algo- 
rithm, a method should ideally be fast and scale well 
with the size of the network analyzed. Many of the net- 
works studied by current science are large, with millions 
or even billions of vertices, so a community detection al- 
gorithm whose running time scales, say, linearly with the 
size of the network is enormously preferred over one that 
scales as size squared or cubed. 

In this paper we derive and demonstrate an algorithm 
for community detection that can find either overlap- 
ping or nonoverlapping communities and satisfies all of 
the demands above. On standard benchmark tests the 
algorithm has performance similar to the best previ- 
ous algorithms in detecting known community structure. 
The algorithm is based on established methods of sta- 
tistical inference, namely maximum likelihood and the 
expectation-maximization algorithm. And the algorithm 
is fast. In its simplest form it consists of the iteration 



of just two sets of equations, each iteration taking an 
amount of time that increases only linearly with system 
size. In practice the algorithm can handle networks with 
millions of vertices and edges in reasonable running times 
on a typical desktop computer. The largest network we 
have analyzed has over 4 million vertices and 40 million 
edges. 

We approach the problem of community detection first 
as a problem of finding overlapping communities. Early 
efforts at community detection, going back to the 1970s, 
assumed nonoverlapping or disjoint communities, but as 
many researchers have argued in the last few years, it is 
common in practical situations for communities to over- 
lap. In social networks, for example, people often belong 
to more than one circle of acquaintances — family, friends, 
coworkers, and so forth — and hence those circles should 
properly be considered as overlapping, since they have at 
least one common member. In biological networks too 
vertices can belong to more than one group. Metabo- 
lites in a metabolic network can play a role in more than 
one metabohc process or cycle; species in a food web can 
fall on the border between two otherwise noninteracting 
subcommunities and play a role in both of them. Thus 
the most general formulation of the community detection 
problem should allow for the possibility of overlap. Our 
approach is to develop a solution to this general problem 
first, then show how a variant of the same approach can 
be applied to nonoverlapping communities as well. 

We tackle the detection of overlapping communities by 
fitting a stochastic generative model of network struc- 
ture to observed network data. This approach, which 
applies methods of statistical inference to networks, has 
been explored by a number of authors for the nonover- 
lapping case, including some work that goes back several 
decades [1-0I . Extending the same approach to the over- 
lapping case, however, has proved nontrivial. The crucial 
step is to devise a generative model that produces net- 
works with overlapping community structure similar to 
that seen in real networks. The models used in most pre- 
vious work are "mixed membership" models [8] , in which, 
typically, vertices can belong to multiple groups and two 
vertices are more likely to be connected if they have more 
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than one group in common. This, however, implies that 
the area of overlap between two communities should have 
a higher average density of edges than an area that falls 
in just a single community. It is unclear whether this 
reflects the behavior of real-world networks accurately, 
but it is certainly possible to construct networks that do 
not have this type of structure. Ideally we would pre- 
fer a less restrictive model that makes fewer assumptions 
about the structure of community overlaps. 

Another set of approaches to the detection of over- 
lapping communities are those based on local commu- 
nity structure. Rather than splitting an entire network 
into communities in one step, these methods instead look 
for local groups within the network, based on analy- 
sis of local connection patterns and ignoring global net- 
work structure. Methods of this kind give rise naturally 
to overlapping communities when one generates a large 
number of independent local communities throughout the 
network. Moreover, the communities tend to be compact 
and connected subgraphs, a requirement not always met 
by other methods. On the other hand, global detection 
methods can capture large-scale network structure bet- 
ter and are more appropriate when particular constraints, 
such as constraints on the number of communities, must 
be satisfied. 

In this paper, we develop a global statistical method 
for detecting overlapping communities based on the idea 
of link communities which has been proposed indepen- 
dently by a number of authors both in the physics liter- 
ature and in machine learning [ll|, [T^ . The idea 
is that communities arise when there are different types 
of edges in a network. In a social network, for instance, 
there are links representing family ties, friendship, pro- 
fessional relationships, and so forth. If we can identify 
the types of the edges, i.e., if we can cluster not the ver- 
tices in a network but the edges, then we can deduce the 
communities of vertices after the fact from the types of 
edges connected to them. This approach has the nice 
feature of matching our intuitive idea of the origin and 
nature of community structure while giving rise to over- 
lapping communities in a natural way: a vertex belongs 
to more than one community if it has more than one type 
of edge. 

Previous approaches to the discovery of link commu- 
nities have made use of heuristic quality functions opti- 
mized over possible partitions of a network's edges (9l.[l0|. 
Such quality functions, particularly the so-called mod- 
ularity function T^, have been used in the past for 
nonoverlapping communities, but while in practice these 
functions often give reasonable results, they also have 
some deficiencies: the modularity for instance cannot be 
used to find very small communities [14i], may not have 
a unique optimum [Tsj . and is somewhat unsatisfactory 
from a formal viewpoint [l^ • Recent results of Bickel 
and Chen 17] suggest that these deficiencies can be reme- 
died by abandoning the quality function approach and 
instead fitting a generative model to the data. This is 
the approach we take, but the definition of a model for 



link communities demands some subtlety. In generative 
models for vertex communities one can assign vertices to 
groups first and then place edges based on that assign- 
ment. But for a model of link communities, where it is 
the edges that are partitioned, one cannot assign edges to 
groups until the edges exist, so the edges and their group- 
ings have to be generated simultaneously. We describe in 
detail how we achieve this in the following section. Once 
we have the model, the goal will be to determine the val- 
ues of its parameters that best fit the observed network 
and from those to determine the overlapping vertex com- 
munities. 

The outline of the paper is as follows. First we define 
our model and then demonstrate how the best-fit values 
of its parameters can be calculated using a maximum 
likelihood algorithm. In its simplest form this algorithm 
is only moderately fast, but we demonstrate that many 
of the model parameters converge rapidly to trivial val- 
ues and hence can be pruned from the calculation. We 
give a prescription for performing this pruning, resulting 
in a significantly faster algorithm which is practical for 
applications to very large networks. We give example 
applications to numerous real- world networks, as well as 
tests against synthetic networks that demonstrate that 
the algorithm can discover known overlapping commu- 
nity structure in such networks. 

Finally, we show how our method can be used also 
to detect nonoverlapping communities by assigning each 
vertex solely to the community to which it most strongly 
belongs in the overlapping division. We demonstrate that 
this intuitive heuristic can be justified rigorously by re- 
garding the link community model as a relaxation of a 
stochastic blockmodel for disjoint communities [l^. Al- 
gorithms have been proposed previously for fitting this 
blockmodel, but their running time was always at least 
quadratic in the number of vertices, which limited their 
application to smaller networks. The algorithm proposed 
here is significantly faster and hence can be applied to the 
detection of disjoint communities in very large networks. 



II. A GENERATIVE MODEL FOR LINK 
COMMUNITIES 

Our first step is to define the generative network model 
that we will use. The model generates networks with a 
given number n of vertices and undirected edges divided 
among a given number K of communities. It is con- 
venient to think of the edges as being colored with K 
different colors to represent the communities to which 
they belong. Then the model is parametrized by a set 
of parameters Oiz , which represent the propensity of ver- 
tex i to have edges of color z. Specifically, 0iz9jz is the 
expected number of edges of color z that lie between ver- 
tices i and j, the exact number being Poisson distributed 
about this mean value. Note that this means the network 
is technically a multigraph — it can have more than one 
edge between a pair of vertices. Most real- world networks 
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have single edges only, and in this sense the model is un- 
realistic. However, allowing multiedges makes the model 
enormously simpler to treat and in practice the values of 
the 9iz will be small so that the number of multiedges, 
and hence the error introduced, is also small. The same 
approximation is made in most other random graph mod- 
els of networks, including, for instance, the widely stud- 
ied configuration model [19., 20j . Our model also allows 
self-edges — edges that connect to the same vertex at both 
ends — with expected number ^Oi^Oiz, the extra factor of 
a half being convenient for consistency with later results. 
Again, the appearance of self-edges, while unrealistic in 
some cases, greatly simplifies the mathematical devel- 
opments and is typical of other random graph models 
including the configuration model. 

In the model defined here the link communities arise 
implicitly as the network is generated, as discussed in 
the introduction, rather than being spelled out explic- 
itly. Two vertices i,j which have large values of 9iz and 
6jz for some value of z have a high probability of be- 
ing connected by an edge of color z, and hence groups 
of such vertices will tend to be connected by relatively 
dense webs of color-z edges — precisely the structure we 
expect to see in a network with link communities. 



III. DETECTING OVERLAPPING 
COMMUNITIES 

Given the model defined above, it is now straightfor- 
ward to write down the probability with which any par- 
ticular network is generated. Recalling that the sum of 
independent Poisson-distributed random variables is also 
a Poisson-distributed random variable, the expected to- 
tal number of edges of all colors between two vertices i 
and j is simply J^z ^^zdjz (or i Y.z ^i^^i^ self-edges), 
and the actual number is Poisson-distributed with this 
mean. Thus the probability of generating a graph G with 
adjacency matrix elements Aij is 
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(Recall that the adjacency matrix element Aij, by con- 
vention, takes the value Aij = 1 if there is an edge be- 
tween distinct vertices i and j, but An = 2 for a self- 
edge — hence the additional factors of ^ in the second 
product.) 

We fit the model to an observed network by maximiz- 
ing this probability with respect to the parameters 9iz , or 
equivalently (and more conveniently) maximizing its log- 
arithm. Taking the log of Eq. ([T|), rearranging, and drop- 
ping additive and multiplicative constants (which have 
no effect on the position of the maximum), we derive the 



Direct maximization of this expression by differentiating 
leads to a set of nonlinear implicit equations for 9iz that 
are hard to solve, even numerically. An easier approach 
is the following. First we apply Jensen's inequality in the 
form [2I: 



> 



(3) 



where the Xz are any set of positive numbers and the 
qz are any probabilities satisfying X^z 9^ = 1- Note that 
the exact equality can always be achieved by making the 
particular choice — Xz/ ^^Xz- Applying Eq. ^ to 
Eq. ^ gives 
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where the probabilities qij (z) can be chosen in any way 
we please provided they satisfy J^zlvi^) ~ 1- Notice 
that the qij{z) are only defined for vertex pairs i,j that 
are actually connected by an edge in the network (so 
that Aij =1), and hence there are only as many of them 
as there are observed edges. 

Since, as noted, the exact equality in this expression 
can always be achieved by a suitable choice of qij{z), it 
follows that the double maximization of the right-hand 
side of (|3]) with respect to both the qij{z) and the 9iz 
is equivalent to maximizing the original log-likelihood, 
Eq. ©, with respect to the 9iz alone. It may appear that 
this does not make our optimization problem any simpler: 
we have succeeded only in turning a single optimization 
into a double one, which one might well imagine was a 
more difficult problem. Delightfully, however, it is not; 
the double optimization is actually very simple. Given 
the true optimal values of 9iz , the optimal values of qij (z) 
are given by 
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since these are the values that make our inequality an 
exact equality. But given the optimal values of the qij (z), 
the optimal 9iz can be found by differentiating ([4]), which 
gives 
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Summing this expression over i and rearranging gives us 
2 
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and combining with (|6l) again then gives 



(8) 



Maximizing the log-hkelihood is now simply a matter of 
simultaneously solving Eqs. ([5]) and dH), which can be 
done iteratively by choosing a random set of initial val- 
ues and alternating back and forth between the two equa- 
tions. This type of approach is known as an expectation- 
maximization or EM algorithm and it can be proved that 
the log-likelihood increases monotonically under the it- 
eration, though it does not necessarily converge to the 
global maximum. To guard against the possibility of 
getting stuck in a local maximum, we repeat the entire 
calculation a number of times with random initial con- 
ditions and choose the result that gives the highest final 
log-likelihood. 

The value of qij{z) in Eq. ([SJ has a simple physical 
interpretation: it is the probability that an edge between 
i and j has color z, which is precisely the quantity we 
need in order to infer link communities in the network. 
Notice that qij{z) is symmetric in as it should be for 
an undirected network. 

The calculation presented here is mathematically 
closely related to methods developed in the machine 
learning community for the analysis of text documents. 
Specifically, the model we fit can be regarded as a variant 
of a model used in probabilistic latent semantic analysis 
(PLSA) — a technique for automated detection of topics 
in a corpus of text — adapted to the present context of 
link communities. Connections between text analysis and 
community detection have been explored by several pre- 
vious authors. Of particular interest is the work of Pso- 
rakis et al. [g^l, which, though it does not focus on link 
communities, uses another variant of the PLSA model, 
coupling it with an iterative fitting algorithm called non- 
negative matrix factorization to find overlapping com- 
munities in directed networks. Also of note is the work 
of Parkinnen et al. 11 1| . who consider link communities 



as we do, but take a contrasting algorithmic approach 
based on a Bayesian generative model and Markov chain 
Monte Carlo techniques. A detailed description of the 
interesting connections between text processing and net- 
work analysis would take us some way from the primary 
purpose of this paper, but for the interested reader we 
give a discussion and references in Appendix |^ 



IV. IMPLEMENTATION 

The method outlined above can be implemented di- 
rectly as a computer algorithm for finding overlapping 
communities, and works well for networks of moderate 
size, up to tens of thousands of vertices. For larger net- 
works both memory usage and run-time become substan- 
tial and prevent the application of the method to the 
largest systems, but both can be improved by using a 



more sophisticated implementation which makes appli- 
cations to networks of millions of vertices possible. 

The algorithm's memory use is determined by the 
space required to store the parameters: the 9iz require 
0{nK) space while the qij{z) require 0{mK), where n 
and m are the numbers of vertices and edges in the net- 
work. Since m is usually substantially larger than n, this 
means that memory use is dominated by the qij(z). We 
can reduce memory use by reorganizing the algorithm 
in such a way that the qij{z) are never stored. Rather 
than focusing on the 9iz, we work instead with the aver- 
age number kiz of ends of edges of color z connected to 
vertex i: 



(9) 



Given the values of these quantities on a given iteration 
of the algorithm, the calculation of the values at the next 
iteration is then as follows. First we define a new set of 
quantities that will store the new values of the kiz. 
Initially we set all of them to zero. We also calculate the 
average number Kz of edges of color z summed over all 
vertices 



in terms of which the original 9iz parameters are 



(10) 
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Next we go through each edge (i,j) in the network in 
turn and calculate the denominator of Eq. ([5]) for that i 
and j from the values of the kiz thus: 



D 



Ekizkjz 



(12) 



Armed with this value we can calculate the value of qij (z) 
for this i,j and all z from Eq. ([5]): 



f) k h- 



E 



z '^iz^jz 



(13) 



Now we add this value onto the quantities fc-^, and kj^, 
discard the values of D and qij{z), and repeat for the 
next edge in the network. When we have gone through 
all edges in this manner, the quantities fcj^ will be equal 
to the sum in Eq. and hence will be the correct new 
values of kiz- 

This method requires us to store only the old and new 
values of kiz, for a total of 2nK quantities, and not the 
values of qij{z), which results in substantial memory sav- 
ings for larger networks. 

As for the running time, the algorithm as we have de- 
scribed it has a computational complexity of 0{mK) op- 
erations per iteration of the equations, where m is again 
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the number of edges in the network. The primary de- 
terminant of the total run-time is the number of itera- 
tions that have to be performed before the values of the 
kiz converge. In practice, we find in many cases that a 
rather large number of iterations is required, which slows 
the performance of the method, but the speed can be 
improved as follows. 

In a typical application of the algorithm to a network, 
the end result is that each vertex belongs to only a sub- 
set of the K possible communities. To put that another 
way, we expect that a large fraction of the parameters k^^ 
will tend to zero under the EM iteration. It is straight- 
forward to see from the equations above that if a partic- 
ular kiz ever becomes zero, then it must remain so for 
all future iterations, which means that it no longer need 
be updated and we can save ourselves time by exclud- 
ing it from our calculations. This leads to two useful 
strategies for pruning our set of variables. In the first, 
we set to zero any kiz that falls below a predetermined 
threshold S. Once a kiz has been set to zero, the cor- 
responding values of the qij{z) on all the adjacent edges 
are also zero and therefore need not be calculated. Thus, 
for each edge, we need only calculate the values of qij{z) 
for those colors z for which both kiz and kjz are nonzero, 
i.e., for the intersection of the sets of colors at vertices i 
and j. This strategy leads to significant speed increases 
when the number K of communities is large. For smaller 
values of K the speed savings are outweighed by the ad- 
ditional computational overhead and it is more efficient 
to simply calculate all qij{z), but we nonetheless still set 
the values of the kiz to zero below the threshold 5 because 
it makes possible our second pruning strategy. 

Our second strategy, which can be used in tandem with 
the first and gives significant speed improvements for all 
values of i^, is motivated by the observation that if all 
but one of the kiz for a particular vertex are set to zero, 
then the color of the vertex is fixed at a single value and 
will no longer change at all. If both vertices at the ends 
of an edge (i, j) have this property, if both of them have 
converged to a single color and are no longer changing, 
then the edge connecting them no longer has any effect 
on the calculation and can be deleted entirely from the 
network. 

By the use of these two strategies the speed of our cal- 
culations is improved markedly. We find in practice that 
the numbers of parameters kiz and edges both shrink 
rapidly and substantially with the progress of the calcu- 
lation, so that the majority of the iterations involve only a 
subset, typically those associated with the vertices whose 
community identification is most ambiguous. If the value 
of the threshold 5 is set to zero, then the pruned algo- 
rithm is exactly equivalent to the original EM algorithm 
and the results are identical, yet even with this choice we 
find substantial speed improvements. If 5 is chosen small 
but nonzero — we use 5 = 0.001 in our calculations [1^ — 
then we introduce an approximation into the calculation 
which means the results will be different in general from 
the original algorithm. In practice, however, the differ- 



ence is small, and the nonzero 5 gives us an additional 
significant speed improvement. 

A detailed comparison of results and run-times for the 
pruned and original versions of the algorithm is given in 
Appendix IB] for a range of networks. Unless stated oth- 
erwise, all calculations presented in the remainder of the 
paper are done with the faster version of the algorithm. 



V. RESULTS 

We test the performance of the algorithm described 
above using both synthetic (computer-generated) net- 
works and a range of real-world examples. The synthetic 
networks allow us to test the algorithm's ability to detect 
known, planted community structure under controlled 
conditions, while the real networks allow us to observe 
performance under practical, real-world conditions. 



A. Synthetic networks 

Our synthetic network examples take the form of a 
classic consistency test. We generate networks using the 
same stochastic model that the algorithm itself is based 
on and measure the algorithm's ability to recover the 
known community divisions for various values of the pa- 
rameters. One can vary the values to create networks 
with stark community structure (which should make de- 
tection easy) or no community structure at all (which 
makes it impossible) , and everything in between, and we 
can thereby vary the difficulty of the challenge we pose 
to the algorithm. 

The networks we use for our tests have n = 10000 
vertices each, divided into two overlapping communities. 
We place x vertices in the first community only, meaning 
they have connections only to others in that community, 
y vertices in the second community only, and the remain- 
ing z = n — x — y vertices in both communities, with equal 
numbers of connections to vertices in either group on av- 
erage. We fix the expected degree of all vertices to take 
the same value k. 

We perform three sets of tests. In the first we fix the 
size of the overlap between the communities at z = 500, 
divide the remaining vertices evenly x = y — 4750, and 
observe the behavior of the algorithm as we vary the value 
of k. When fc — >■ there are no edges in the network 
and hence no community structure, and we expect the 
algorithm (or any algorithm) to fail. When k is large, on 
the other hand, it should be straightforward to work out 
where the communities are. 

For our second set of tests we again set the overlap at 
z = 500 but this time we fix fc = 10 and vary the balance 
of vertices between x and y. Finally, for our third set of 
tests we set fc = 10 and constrain x and y to be equal, 
but allow the size z of the overlap to vary. 

In Fig. [T] we show the measured fraction of vertices 
classified correctly (black curve) in each of these three 
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FIG. 1: Results from the three sets of synthetic tests described in the text. Each data point is averaged over 100 networks. 
Twenty random initializations of the variables were used for each network and the run giving the highest value of the log- 
likelihood was taken as the final result. In each panel the black curve shows the fraction of vertices assigned to the correct 
communities by the algorithm, while the red curve is the Jaccard index for the vertices in the overlap. 



sets of tests (the three separate panels), averaged over 
100 networks for each point. To be considered correctly 
classified a vertex's membership (or lack of membership) 
in both groups must be reported correctly by the algo- 
rithm, and the algorithm considers any vertex to be a 
member of a group if, on average, it has at least one edge 
of the appropriate color when the maximum likelihood 
fitting procedure is complete. In mathematical terms, a 
vertex belongs to community z if its expected degree with 
respect to color 2;, given by Aijqij{z), is greater than 
one. 

As the figure shows, there are substantial parameter 
ranges for all three tests for which the algorithm per- 
forms well, correctly classifying most of the vertices in 
the network. As expected the accuracy in the first test 
increases with increasing k and for values of k greater 
than about ten — a figure easily attained by many real- 
world networks — the algorithm identifies the known com- 
munity structure essentially perfectly. In the other two 
tests accuracy declines as either the asymmetry of the 
two groups or the size of the overlap increases, but ap- 
proaches 100% when either is small. 

To probe in more detail the algorithm's ability to iden- 
tify overlapping communities, we have also measured, for 
the same test networks, a Jaccard index: if S is the set 
of vertices in the true overlap and V is the set the algo- 
rithm identifies as being in the overlap, then the Jaccard 
index isJ^IS'nT^I/IS'uy]. This index is a standard 
measure of similarity between sets that rewards accurate 
identification of the overlap while penalizing both false 
positives and false negatives. The values of the index are 
shown as the red curves in Fig. [T] and, as we can see, the 
general shape of the curves is similar to the overall frac- 
tion of correctly identified vertices. In particular, we note 
that for networks with sufficiently high average degree k 
the value of J tends to 1, implying that the overlap is 
identified essentially perfectly. 



B. Real networks 

We have also tested our method on numerous real- 
world networks. In this section we give detailed results 
for three specific examples. Summary results for a num- 
ber of additional examples are given in Appendix [B] 

Our first example is one that has become virtually 
obligatory in tests of community detection, Zachary's 
"karate club" network, which represents friendship pat- 
terns between members of a university sports club, de- 
duced from an observational study 24]. The network is 
interesting because the club split in two during the study, 
as a result of an internal dispute, and it has been found 
repeatedly that one can deduce the lines of the split from 
a knowledge of the network structure alone. 

Figure [5^ shows the decomposition of the karate club 
network into two overlapping groups as found by our al- 
gorithm. The colors in the figure show both the division 
of the vertices and the division of the edges. The split 
between the two groups in the club is clearly evident in 
the results and corresponds well with the acknowledged 
"ground truth," but in addition the algorithm assigns 
several vertices to both groups. The individuals repre- 
sented by these overlap vertices, being by definition those 
who have friends in both camps, might be supposed to 
have had some difficulty deciding which side of the dis- 
pute to come down on, and indeed Zachary's original 
discussion of the split includes some indications that this 
was the case. Note also that, in addition to identify- 
ing overlapping vertices, our method can assign to each 
a fraction by which it belongs to one community or the 
other, represented in the figure by the pie-chart coloring 
of the vertices in the overlap. The fraction is calculated 
as the expected fraction of edges of each color incident 
on the vertex. 

Our second example is another social network and 
again one whose community structure has been studied 
previously. This network, compiled by Knuth [2^, rep- 
resents the patterns of interactions between the fictional 
characters in the novel Les Miserables by Victor Hugo. 
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In this network two characters are connected by an edge 
if they appear in the same chapter of the book. Fig- 
ure [2Jd shows our algorithm's partition of the network 
into six overlapping communities and the partition ac- 
cords roughly with social divisions and subplots in the 
plot-line of the novel. But what is particularly interesting 
in this case is the role played by the hubs in the network — 
the major characters who are represented by vertices of 
especially high degree. It is common to find high-degree 
hubs in networks of many kinds, vertices with so many 
connections that they have links to every part of the 
network, and their presence causes problems for tradi- 
tional, nonoverlapping community detection schemes be- 
cause they do not fit comfortably in any community: no 
matter where we place a hub it is going to have many con- 
nections to vertices in other communities. Overlapping 
communities provide an elegant solution to this problem 
because we can place the hubs in the overlaps. As Fig. [2b 
shows, our algorithm does exactly this, placing many of 
the hubs in the network in two or more communities. 
Such an assignment is in this case also realistic in terms 
of the plot of the novel: the major characters represented 
by the hubs are precisely those that appear in more than 
one of the book's subplots. 

A similar behavior can be seen in our third exam- 
ple, which is a transportation network, the network of 
passenger airline flights between airports in the United 
States. In this network, based on data for flights in 
2004, the vertices represent airports and an edge be- 
tween airports indicates a regular scheduled direct flight. 
Spatial networks, those in which, as here, the vertices 
have well-defined positions in geographic space, are of- 
ten found to have higher probability of connection for 
vertex pairs located closer together |26l , [27j , which sug- 
gests that communities, if they exist, should be regional, 
consisting principally of blocks of nearby vertices. The 
communities detected by our algorithm in the airline net- 
work follow this pattern, as shown in Fig. [3] The three- 
way split shown divides the network into east and west 
coast groups and a group for Alaska. The overlaps are 
composed partly of vertices that lie along the geographic 
boundaries between the groups, but again include hubs 
as well, which tend to be placed in the overlaps even 
when they don't lie on boundaries. As with the previous 
example, this placement gives the algorithm a solution 
to the otherwise difficult problem of assigning to any one 
group a hub that has connections to all parts of the net- 
work, but it also makes intuitive sense. Hubs are the 
"brokers" of the airline network, the vertices that con- 
nect different communities together, since they are pre- 
cisely the airports that one passes through in traveling 
between distant locations. Thus it is appropriate that 
they be considered members of more than one group. In 
most cases the hubs belong most strongly to the commu- 
nity in which they are geographically located, and less 
strongly to other communities. 




FIG. 2: Overlapping communities in (a) the karate club net- 
work of [24[[and (b) the network of characters from Les 
Miserables [2^, as calculated using the algorithm described 
in this paper. The edge colors correspond to the highest value 
of qij{z) for the given edge, while vertex colors indicate the 
fraction of incident edges that fall in each community. For 
vertices in more than one community the vertices are drawn 
larger for clarity and divided into pie charts representing their 
division among communities. 



VI. NONOVERLAPPING COMMUNITIES 

As we have described it, our algorithm is an algo- 
rithm for finding overlapping communities in networks, 
but it can be used to find nonoverlapping communities 
as well. As p ointed out by a number of previous au- 
thors [13, Ha H^j any algorithm that calculates pro- 
portional membership of vertices in communities can be 
adapted to the nonoverlapping case by assigning each 
vertex to the single community to which it belongs most 
strongly. In our case, this means assigning vertices to 
the community for which the value of 0iz is largest. It 
turns out that this procedure can be justified rigorously 
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FIG. 3: Overlapping communities in the network of US passenger air transportation. The three communities produced by the 
calculation correspond roughly to the east and west coasts of the country and Alaska. 



in our case by regarding the link community model as a 
relaxation of a nonoverlapping degree-corrected stochas- 
tic blockmodel. The details are given in Appendix [C] 
Here we give some example applications to show how the 
approach works in practice. 

As with the overlapping case, we test the method on 
both synthetic and real-world networks. For the syn- 
thetic case we use a standard test, the LFR benchmark 
for unweighted undirected networks with planted com- 
munity structure jsol [3ll | . To make comparisons simple 
we use the same parameters as in Ref. j31j with networks 
of 1000 and 5000 vertices, average degree 20, maximum 
degree 50, degree exponent —2, and community expo- 
nent — 1. We also use the same two ranges of community 
sizes, with communities of 10 to 50 vertices for one set 
of tests (labeled S for "small" in our figures) and 20 to 
100 vertices for the other set (labeled B for "big"). The 
value of K for the detection algorithm was set equal to 
the number of communities in the benchmark network 
(which, because of the nature of the benchmark, is not a 
constant but varies from one network to another) . 

To quantify our algorithm's success at detecting the 
known communities in the benchmark networks we use 
the variant normalized mutual information measure pro- 
posed in (sTf. We note that this measure is different, and 
in general returns different results, from the traditional 



normalized mutual information used to evaluate commu- 
nity structure but using it allows us to make direct 
comparisons with the results for other algorithms given 
in [31]. 

In our benchmark tests we find that the simplistic 
rounding method described above for nonoverlapping 
communities, just choosing the community with the high- 
est value of 9iz , returns only average performance when 
compared to the other algorithms tested in Ref. [3l| . 
However, a simple modification of the algorithm produces 
significantly better results: after generating a candidate 
division into communities using the rounding method, 
we then apply a further optimization step in which we 
move from one community to another the single vertex 
that most increases the log-likelihood of the division un- 
der the stochastic blockmodel, and repeat this exercise 
until no further such moves exist. This process, which is 
easy to implement and carries little computational cost 
when compared to the calculation of the initial division, 
improves our results dramatically. 

The results of our tests are shown in Figure S) The 
top panel shows the performance of the algorithm with- 
out the additional optimization step and the results fall 
in the middle of the pack when compared to previous al- 
gorithms, better than some methods but not as good as 
others. The bottom panel shows the results with the ad- 
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FIG. 4: Results of tests of the nonoverlapping community 
algorithm described in the text applied to synthetic net- 
works generated using the LFR benckmark model of Lan- 
cichinetti et al. [31i |. Parameters used were the same as in 
Ref. (3ll | and (S) and (B) denote the "small" and "big" com- 
munity sizes used by the same authors. The top and bottom 
panels respectively show the results without and with post- 
processing to optimize the value of the log- likelihood. Ten 
random initializations of the variables were used for each net- 
work and each point is an average over 100 networks. 




FIG. 5: Non-overlapping communities found in the US col- 
lege football network of [1,] . The clusters of vertices represent 
the communities found by the algorithm, while the vertex 
colors represent the "conferences" into which the colleges are 
formally divided. As we can see, the algorithm in this case 
extracts the known conference structure perfectly. (The dark 
purple vertices represent independent colleges that belong to 
no conference.) 



ditional optimization step added, and now the algorithm 
performs about as well as, or better than, the algorithms 
analyzed in Ref. (sjj. The general shape of the mutual 
information curve is similar to that of the best compet- 
ing methods, falling off around the same place, although 
the mutual information values are somewhat lower for 
low values of the mixing parameter, indicating that the 
method is not getting the community structure exactly 
correct in this regime. Examining the communities in de- 
tail reveals that the method occasionally splits or merges 
communities. It is possible that performance could be 
improved further by a less simple-minded post-processing 
step for optimizing the likelihood. 

We also give, in Fig. [5l an example of a test of the 
method against a real-world network, in this case the 
much studied college football network of Ref. In 
this network the vertices represent university teams in 
American football and the edges represent the schedule 
of games for the year 2000 football season, two teams be- 
ing connected if they played a game. It has been found in 
repeated analyses that a clustering of this network into 
communities can retrieve the organizational units of US 
college sports, called "conferences," into which universi- 
ties are divided for the purposes of competition. In 2000 
there were 11 conferences among the Division I- A teams 
that make up the network, as well as 8 teams indepen- 
dent of any conference. As Fig. [S] shows, every single 
team that belongs in a conference is placed correctly by 
our algorithm. 



VII. CONCLUSION 

In this paper we have described a method for detect- 
ing communities, either overlapping or not, in undirected 
networks. The method has a rigorous mathematical foun- 
dation, being based on a probabilistic model of link com- 
munities; is easy to implement, fast enough for networks 
of millions of vertices; and gives results competitive with 
other algorithms. 

Nonetheless, the method is not perfect. Its main cur- 
rent drawback is that it offers no criterion for determin- 
ing the value of the parameter we call K, the number 
of communities in a network. This is a perennial prob- 
lem for community detection methods of all kinds. Some 
methods, such as modularity maximization, do offer a so- 
lution to the problem, but that solution is known to give 
biased answers or be inconsistent under at least some 
circumstances [H, More rigorous approaches such 
as the Bayesian information criterion and the Akaike in- 
formation criterion are unfortunately not applicable here 
because many of the model parameters are zero, putting 
them on the boundary of the parameter space, which in- 
validates the assumptions made in deriving these criteria. 

Another approach is to perform the calculations with 
a large value of K and regularize the parameters in a 
manner such that some communities disappear, meaning 
that zero edges are associated with those communities. 
For example, Psorakis et al. [22|, in studies using their 
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matrix factorization algorithm, used priors that penal- 
ized their model for including too many nonzero param- 
eter values and hence created a balance between num- 
bers of communities and goodness of fit to the network 
data. Unfortunately, the priors themselves contain un- 
determined parameters whose values can influence the 
number of communities and hence the problem is not 
completely solved by this approach. 

We believe that statistical model selection methods ap- 
plied to generative models should in principle be able to 
find the number of communities in a consistent and sat- 
isfactory manner. We have performed some initial ex- 
periments with such methods, and the quality of the re- 
sults seems promising, but the methods are at present 
too computationally demanding to be applied to any but 
the smallest of networks. It is an open problem whether 
a reliable method can be developed that runs in reason- 
able time on the large networks of interest to today's 
scientists. 
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Appendix A: Community detection and statistical 
text analysis 

As mentioned in the main text, the generative model 
we use is the network equivalent of a model used in 
text anal ysis called probabilistic latent semantic analysis 
(PLSA) |32| - [34| . modified somewhat for the particular 
problem we are addressing. In this appendix, we de- 
scribe PLSA and related methods and models and their 
relationship to the community detection problem. 

A classic problem in text analysis, which is addressed 
by the PLSA method, is that of analyzing a "corpus" of 
text documents to find sets of words that all (or mostly) 
occur in the same documents. The assumption is that 
these sets of words correspond to topics or themes that 
can be used to group documents according to content. 
The PLSA approach regards documents as a so-called 
"bag of words," meaning one considers only the number 
of times each word occurs in a document and not the 
order in which words occur. (Also, one often considers 
only a subset of words of interest, rather than all words 
that appear in the corpus.) 

Mathematically a corpus of D documents and W words 
of interest is represented by a matrix A having elements 
A-wd equal to the number of times word w appears in 
document d. To make the connection to networks, this 
matrix can be thought of as the incidence matrix of a 
weighted bipartite network having one set of vertices for 



the documents, one for the words, and edges connecting 
words to the documents in which they appear with weight 
equal to their frequency of occurrence. 

In PLSA each word-document pair — an edge in the cor- 
responding network picture — is associated with an unob- 
served variable z which denotes one of K topical groups. 
Each edge is assumed to be placed independently at ran- 
dom in the bipartite graph, with the probability that 
an edge falls between word w and document d being 
broken down in the form P(w|z)P((i|z)P(z), where 
P{z) is the probability that the edge belongs to topic z, 
P{w\z) is the probability that an edge with topic z con- 
nects to word w, and P{d\z) is the probability that an 
edge with topic z connects to document d. Note that, 
given the topic, the document and word ends of each 
edge are placed independently. (Hofmann ,32] calls this 
parametrization a "symmetric" one, meaning that the 
word and the document play equivalent roles mathemat- 
ically, but in the networks jargon this would not be con- 
sidered a symmetric formulation — the network is bipar- 
tite and the incidence matrix is not symmetric, nor even, 
in general, square.) 

An alternative description of the model, which is useful 
for actually generating the incidence matrix and which 
corresponds with our formulation of the equivalent net- 
work problem, is that each matrix element A^^d takes a 
random value drawn independently from a Poisson distri- 
bution with mean P{w\z)P{d\z) ujz- In the language 
of networks, each edge is placed with independent proba- 
bility 2^P(u;|z)P(d|z)P(z), where P{z) = ^zlY.z' 
In our work, where we focus on one-mode networks and 
a symmetric adjacency matrix instead of an incidence 
matrix, the parameter is redundant and we omit it. 

PLSA involves using the edge probability above to cal- 
culate a likelihood for the entire word-document distri- 
bution, then maximizing with respect to the unknown 
probabilities P(w|z), P((i|z), and Piz). The resulting 
probabilities give one a measure of how strongly each 
word or document is associated with a particular topic z, 
but since the topics are arbitrary, this is effectively the 
same as simply grouping the words and documents into 
"communities." Alternatively, one can use the probabil- 
ities to divide the edges of the bipartite graph among 
the topical groups, giving the text equivalent of the "link 
communities" that are the focus of our calculations. 

A number of methods have been explored for maximiz- 
ing the likelihood. Mathematically the one most closely 
related to our approach is the expectation-maximization 
(EM) algorithm of Hoffman (32,-34,] . though the corre- 
spondence is not exact. Hofmann's work focuses solely 
on text processing — the connection to networks was not 
made until later — and the method cannot be translated 
directly for applications to standard undirected one- 
mode networks. It is possible to generalize the method 
to directed one-mode networks in a fairly straightforward 
fashion, and one might imagine that the undirected case 
could then be treated as a directed network with di- 
rected edges running in both directions between every 
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connected pair of vertices. Unfortunately, this approach 
does not work, since in general it resuhs in a posterior 
edge probabihty qij{z) which is asymmetric in i and j. 
This asymmetry causes problems when we want to asso- 
ciate a community with an undirected edge. If qij ^ qji, 
then the edge may be in one community when considered 
as an edge from i to j and a different community when 
considered as an edge from j to i. 

Rather than applying the standard PLSA model to 
network problems, therefore, a better approach is to use 
a model that is intrinsically symmetric from the out- 
set, and this leads us to the formulation in this paper. 
This symmetric formulation and the corresponding EM 
algorithm have not, to our knowledge, been used pre- 
viously for community detection in networks, but sev- 
eral other related approaches have, including ones based 
on the techniques known as nonnegative matrix factor- 
ization (NMF) [35I, HBl and latent dirichlet allocation 
(LDA) [33, |3^. These formulations have similar goals 
to ours, but are typically asymmetric (and hence unsuit- 
able for undirected networks) and use different algorith- 
mic approaches for maximizing the likelihood. The NMF 
formulation is similar in style to an EM algorithm, us- 
ing an iterative maximization scheme, but the specific 
iteration equations are different. Several papers have re- 
cently propos ed using NMF to find overlapping commu- 
nities |22l. |28|. [29} . and the work of Psorakis et al. ^] 
in particular uses NMF with the PLSA model, although 
again in an asymmetric formulation, and not applied to 
link communities. 

Recent work by Parkinnen et al. [ll| and Gyenge et al. 
[1^ does consider link communities, in an asymmetric 
formulation, but uses algorithmic approaches that are 
different again. Parkinnen et al. [ll| use a model that at- 
taches conjugate priors to the parameters and then sam- 
ples the posterior distribution of link communities with 
a collapsed Gibbs sampler. 

LDA [33, [s^l offers an alternative but related approach 
that also attaches priors to the parameters, but in a spe- 
cific way that relies on the asymmetric formulation of 
the model. In [s^ and [i^l, LDA is adapted to net- 
works by treating vertex-edge pairs as analogous to word- 
document pairs and then associating communities with 
the vertex-edge pairs. This is an interesting approach but 
differs substantially from the others discussed here, in- 
cluding our own, in which vertex- vertex pairs (i.e., edges) 
are the quantity analogous to word-document pairs. 

Finally, in Appendix Owe show that our model can be 
used to find nonoverlapping communities by viewing it as 
a relaxation of a nonoverlapping stochastic blockmodel. 
A corresponding relaxation has been noted previously for 
a version of NMF and was shown to be related to spectral 
clustering [4l|,|42]. 



Appendix B: Results for running time 

As discussed in Section IIVI a naive implementation 
of the EM equations gives an algorithm that is only 
moderately fast — not fast enough for very large net- 
works. We described a more sophisticated implemen- 
tation that prunes unneeded variables from the iteration 
and achieves significantly greater speed. In this appendix 
we give a comparison of the performance of the two ver- 
sions of the algorithm on a set of test networks. 

The results are summarized in Table U which gives the 
CPU time in seconds taken to complete the overlapping 
community detection calculation on a standard desktop 
computer [circa 2011) for each of the test networks. In 
these tests we use 100 random initializations of the vari- 
ables and take as our final result the run that gives the 
highest value of the log-likelihood. For each network we 
give the results of three different calculations: (1) the cal- 
culation performed using the naive EM algorithm; (2) the 
calculation using the pruned algorithm with the thresh- 
old parameter 5 set to zero, meaning the algorithm gives 
results identical to the naive algorithm except for numer- 
ical rounding; and (3) the calculation performed using 
the pruned algorithm with 5 = 0.001, which introduces 
an additional approximation that typically results in a 
slightly poorer final value of the log-likelihood, but gives 
a significant additional speed boost. 

The largest network studied, which is a network of links 
in the online community LiveJournal, is an exception to 
the pattern: for this network, which contains over 40 
million edges, we performed single runs only, with one 
random initialization each, using the pruned algorithm 
with 5 = 0.001 and with (5 = 0. The run with 5 = 0.001 
took about 50 minutes to complete and the run with 
(5 = took about 11 hours. 

While the algorithm described is fast by comparison 
with most other community detection methods, it is pos- 
sible that its speed could be improved further (or that 
the quality of the results could be improved while keep- 
ing the speed the same). Two potential improvements 
are suggested by the text processing literature discussed 
in Appendix lAl The first, from Hofmann [s^l, is to use 
the so-called tempered EM algorithm. The second, from 
Ding et al. ^SQ], is to alternate between the EM algo- 
rithm and a nonnegative matrix factorization algorithm, 
exploiting the fact that both maximize the same objec- 
tive function but in different ways. 



Appendix C: Nonoverlapping communities 

In Section IVII we described a procedure for extract- 
ing nonoverlapping community assignments from net- 
work data by first finding overlapping ones and then as- 
signing each vertex to the community to which it belongs 
most strongly. This procedure was presented as a heuris- 
tic strategy for the nonoverlapping problem, but in this 
appendix we show how it can be derived in a principled 
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Running conditions Time (s) Iterations Log-likelihood 
US air transportation, n = 709, m = 3327, K = 3 



naive, <5 = 
fast, (5 = 
fast, (5 = 0.001 



15.71 
14.67 
2.17 



55719 
55719 
26063 



-8924.58 
-8924.58 
-9074.21 



Network science collaborations [43], n = 379, m = 914, K ■ 



naive, (5 = 
fast, (5 = 
fast, (5 = 0.001 



0.93 
0.82 
0.13 



13165 
13165 
10747 



-3564.74 
-3564.74 
-3577.85 



Network science collaborations, n = 379, m = 914, K = 10 



naive, 5 = 
fast, (5 = 
fast, (5 = 0.001 



3.19 
3.15 
0.49 



18246 
18246 
12933 



-2602.15 
-2602.15 
-2611.96 



Network science collaborations, n = 379, m = 914, = 20 



naive, 5 = 
fast, (5 = 
fast, (5 = 0.001 



6.16 
6.09 
0.94 



19821 
19821 
14010 



-2046.95 
-2046.95 
-2094.85 



Running conditions Time (s) Iterations Log-likelihood 



Political blogs 


[441. n = 1490, m = 16 778, K = 2 


naive, 5 = 


11.42 13773 


-48761.1 


fast, 5 = 


11.46 13773 


-48761.1 


fast, 5 = 0.001 


4.14 13861 


-48765.6 


Physics collaborations [45], n = 40 421, m = 


175 693, K = 2 


naive, 5 = 


4339.57 424077 


-1.367 X 10® 


fast, 5 = 


2557.91 424077 


-1.367 X 10® 


fast, 5 = 0.001 


253.41 61665 


-1.378 X 10" 


Amazon copurchasin; 


I [46], n = 403 394, m = 


2 443 408, K = 2 


naive, 5 = 


170646.9 1222937 


-2.521 X lO'^ 


fast, 5 = 


105042.3 1222937 


-2.521 X lO'^ 


fast, 5 = 0.001 


11635.0 120612 


-2.538 X lO'^ 


LiveJournal [47, 48], n = 4 847571, m = 42 851237, K = 2 


fast, 5 = 


39834.3 26163 


-4.611 X 10* 


fast, 5 = 0.001 


3093.7 1389 


-4.660 X 10* 



TABLE I: Example networks and running times for each of the three versions of the algorithm described in the text. The 
designations "fast" and "naive" refer to the algorithm with and without pruning respectively. "Iterations" refers to the total 
number of iterations for the entire run, not the average number for one random initialization. "Time" is similarly the total 
running time for all initializations. Directed networks were symmetrized for these tests. All networks were run with 100 random 
initializations, except for the LiveJournal network, which was run with only one random initialization. 



manner as an approximation method for fitting the data 
to a degree-corrected stochastic blockmodeL 

Methods have been proposed for discovering nonover- 
lapping communities in networks by fitting to the class of 
models known as stochastic blockmodels. As discussed 
in Ref. [l^, it turns out to be crucial that the block- 
model used incorporate knowledge of the degree sequence 
of the network if it is to produce useful results, and this 
leads us to consider the so-called degree-corrected block- 
model, which can be formulated as follows. We consider 
a network of n vertices, with each vertex belonging to 
exactly one community. The community assignments 
are represented by an indicator variable Sir which takes 
the value 1 if vertex i belongs to community r and zero 
otherwise. To generate the network, we place a Pois- 
son distributed number of edges between each pair of 
vertices i,j, such that the expected value of the adja- 
cency matrix element is BitOrsOj if vertex i belongs to 
group r and vertex j belongs to group s, where Oi and 
LUrs are parameters of the model. To put this another 
way, the mean value of the adjacency matrix element 
is Oi(J2^^ SirUJrsSjs)dj for every vertex pair. The nor- 
malization of the parameters is arbitrary, since we can 
rescale all 6i by the same constant if we simultaneously 
rescale all w^s- In our calculations we fix the normaliza- 
tion so that the 9i sum to unity within each community: 
J2i ^i^ir = 1 for all r. 

Now one can fit this model to an observed network by 
writing the probability of generation of the network as 
a product of Poisson probabilities for each (multi-)edge, 
then maximizing with respect to the parameters 9i and 
ujrs and the community assignments Sir- Unfortunately, 
while the maximization with respect to the continuous 
parameters 9i and tOrs is a simple matter of differentia- 



tion, the maximization with respect to the discrete vari- 
ables Sir is much harder. A common way around such 
problems is to "relax" the discrete variables, allowing 
them to take on continuous real values, so that the op- 
timization can be performed by differentiation. In the 
present case, we would allow the Sir to take on arbi- 
trary non-negative values, subject to the constraint that 
X^r = 1- In effect, Sir now represents the fraction by 
which vertex i belongs to group r, with the constraint 
ensuring that the fractions add correctly to 1. 

With this relaxation, we can now absorb the parame- 
ters 9i into the Sir, defining 9ir — 9iSir with J2i (^ir = ^, 
and the mean number of edges between vertices i and j 
becomes X^rs (^ir'-^rs9js- This is an extended form of the 
overlapping communities model studied in this paper, 
generalized to include the extra K x K matrix ujrs ■ In the 
language of link communities, this generalization gives us 
a model in which the two ends of an edge can belong to 
different communities. One can think of each end of the 
edge as being colored with its own color, instead of the 
whole edge taking only a single color. If constrained 
to be diagonal, then we recover the single-color version 
of the model again. 

We can fit the general (nondiagonal) model to an ob- 
served network using an expectation-maximization algo- 
rithm, just as before. Defining a probability qij{r, s) that 
an edge between i and j has colors r and s, the EM equa- 
tions are now 



/ \ "ir^rs"is 
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and 



By iterating these equations we can find a solution for 
the parameters Oir. But 0ir = OiSir and, summing both 
sides over r, we get Oir = Oi, since J2r ^ir = 1- Hence 



(C3) 



Thus we can calculate the values of Sir and once we have 
these we can then reverse the relaxation of the model by 
rounding the values to zero or one, which is equivalent to 
assigning each vertex i to the community r for which Sir 
is largest, or equivalently the community for which 9ir is 
largest. 

Thus the final algorithm for dividing the network is 
simply to iterate the EM equations to convergence and 
then assign each vertex to the community for which 0ir 
is largest. This is precisely the algorithm that we used 



in Section IVIl except that the model is generalized to 
include the matrix iOrs, where in our original calculations 
this matrix was absent which is equivalent to assuming 
it to be diagonal. In our experiments, however, we have 
found that even when we allow ujrs to be nondiagonal, the 
algorithm commonly chooses a diagonal value anyway, 
which implies that the output of our original algorithm 
and the generalized algorithm should be the same. (We 
note that in practice the diagonal version of the algorithm 
runs faster, and that both are substantially faster than 
the vertex moving heuristic proposed for the stochastic 
blockmodel in Ref. 

It is entirely possible, however, that there could be 
networks with interesting nondiagonal group structure 
that could be detected using the more general model. 
The model including the matrix ujrs can in principle find 
disassortative community structure — structure in which 
connections are less common within communities than 
between them — as well as the better studied assortative 
structure. For example, the model can detect bipartite 
structure in networks, whereas the unadjusted model can 
not. 
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